Mtmd implementation #1261

SignalRT · 2025-09-27T16:47:24Z

Prototype implementation:

Minimally tested on macOS.
Tested unsuccessfully with CUDA 13 (seems to be an issue in llama.cpp itself).
Unit test
The test does not render images.

Copilot

Pull Request Overview

This PR implements a comprehensive migration from the existing LLaVA multimodal architecture to a new MTMD (Multi-Modal Text+Data) implementation. The change introduces a more unified approach to handling multimodal inputs (images, audio, video) by replacing specialized LLaVA components with generic MTMD helpers that support multiple media types through a consistent tokenization and evaluation pipeline.

Migration from LLaVA-specific classes to generic MTMD wrapper classes
Introduction of new native API surface for MTMD tokenization and chunk-based evaluation
Updated executors to use MTMD tokenization instead of direct image embedding evaluation
Comprehensive test coverage for the new MTMD functionality

Reviewed Changes

Copilot reviewed 41 out of 41 changed files in this pull request and generated 6 comments.

Show a summary per file

File	Description
SafeMtmdWeights.cs	New wrapper class for MTMD multimodal weights replacing LLavaWeights
NativeApi.Mtmd.cs	Native P/Invoke surface for MTMD helper functions
SafeMtmdModelHandle.cs	Native handle management for MTMD models with tokenization and evaluation
SafeMtmdInputChunks.cs	Managed wrapper for native chunk collections returned by tokenizer
SafeMtmdInputChunk.cs	Individual chunk wrapper with metadata access and token span views
SafeMtmdEmbed.cs	Media embedding wrapper supporting images, audio, and raw data buffers
LLamaInteractExecutor.cs	Updated interactive executor to use MTMD tokenization workflow
LLamaInstructExecutor.cs	Updated instruct executor with MTMD preprocessing logic
BatchedExecutor.cs	Added MTMD batch evaluation support for batched inference
Conversation.cs	Extended conversation class with multimodal prompting and media queueing

_{Tip: Customize your code reviews with copilot-instructions.md. Create the file or learn how to get started.}

LLama/Native/NativeApi.cs

LLama/Native/MtmdContextParams.cs

LLama/SafeMtmdWeights.cs

LLama/Native/SafeMtmdModelHandle.cs

Copilot · 2025-09-28T15:53:12Z

LLama/LLamaInteractExecutor.cs

+                if (inferenceParams.MaxTokens == 0)
+                {
+                    _embeds.Clear();
+                    args.WaitForInput = true;
+                    args.ReturnValue = false;
+                    return;
+                }


This MaxTokens == 0 check and its logic is duplicated in InstructExecutor. Consider extracting this into a shared method in the base class.

LLama/Batched/Conversation.cs

Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>

martindevans · 2025-10-19T16:14:40Z

LLama/Batched/BatchedExecutor.cs

    : IDisposable
 {
    private int _nextSequenceId;
-    private readonly List<IBatch> _batchQueue = [];


It looks like the changes from #1262 have been undone here? Probably an accident that requires rebasing?

martindevans · 2025-10-19T16:17:35Z

LLama/Batched/BatchedExecutor.cs

        }
    }
+
+    private class MtmdChunkBatch : IBatch


The way the batched executor worked with llava previously was llava embedded the image, and then you could prompt with the raw embeddings (EmbeddingBatch). That was nice, since the batched executor was totally independent from llava. Is that no longer possible with MTMD?

martindevans · 2025-10-19T16:24:33Z

LLama/Batched/Conversation.cs

    /// </summary>
    private bool _forked;
+    private readonly List<SafeMtmdEmbed> _mtmdEmbeds = new();
+    private int? _mtmdLogitsIndex;


This only ever seems to be set to -1 or null?

martindevans · 2025-10-19T16:27:48Z

LLama/Batched/Conversation.cs

+        _mtmdLogitsIndex = null;
+    }
+
+    public void QueueMedia(SafeMtmdEmbed embed)


Could we add an overload of Prompt that takes a Span<SafeMtmdEmbed> instead of enqueuing them? This removes some state and makes the ownership of the embed objects clearer (e.g. at the moment you could call QueueMedia, then dispose the media, which would trigger an error on the next call to Prompt when you tried to use that disposed object).

martindevans · 2025-10-19T16:30:37Z

LLama/Batched/ConversationExtensions.cs

-        return sampler.Sample(conversation.Executor.Context.NativeHandle, conversation.GetSampleIndex(offset));
+        var ctx = conversation.Executor.Context.NativeHandle;
+        if (conversation.MtmdLogitsIndex == -1)
+            return sampler.Sample(ctx, -1);


This looks very odd to me - if you had multiple active conversations and prompted them all with an image then next time you would be sampling from index=-1 for all of the conversations! Maybe that's right, it just caught my eye.

martindevans · 2025-10-19T16:32:35Z

LLama/Native/MtmdContextParams.cs

+        };
+    }
+
+    private static string? PtrToString(IntPtr ptr)


Can this be extracted out into a helper, it's duplicated elsewhere at the moment

martindevans · 2025-10-19T16:32:49Z

LLama/Native/NativeApi.Mtmd.cs

+    /// Convert a UTF-8 encoded native string pointer into a managed <see cref="string"/>.
+    /// Returns <c>null</c> when the pointer is zero.
+    /// </summary>
+    public static string? PtrToStringUtf8(IntPtr ptr)


See other comment about duplicated code

martindevans · 2025-10-19T16:34:26Z

LLama/Native/SafeMtmdEmbed.cs

+    /// Managed wrapper around <c>mtmd_bitmap*</c> resources. Instances own the native pointer
+    /// and ensure proper cleanup when disposed.
+    /// </summary>
+    public sealed class SafeMtmdEmbed : IDisposable


Since this owns a native resource it should probably implement SafeHandle

martindevans · 2025-10-19T16:34:56Z

LLama/Native/SafeMtmdInputChunk.cs

+/// underlying native pointer (when created via <see cref="Copy"/>) or act as non-owning views
+/// produced by the tokenizer.
+/// </summary>
+public sealed class SafeMtmdInputChunk : IDisposable


Since this owns a native resource it should probably implement SafeHandle

martindevans · 2025-10-19T16:35:16Z

LLama/Native/SafeMtmdInputChunks.cs

+/// <summary>
+/// Managed lifetime wrapper around a native <c>mtmd_input_chunks</c> collection returned by the tokenizer.
+/// </summary>
+public sealed class SafeMtmdInputChunks : IDisposable


Since this owns a native resource it should probably implement SafeHandle

martindevans · 2025-10-19T16:37:13Z

LLama/SafeMtmdWeights.cs

+/// <summary>
+/// Lightweight wrapper around the MTMD native context and its helpers.
+/// </summary>
+public sealed class SafeMtmdWeights : IDisposable


Rename to MtmdWeights for consistency with LLamaWeights?

martindevans

Thanks for all the hard work putting this together! Lots of small review nitpicks, but overall this looks really solid 👍

This was referenced Sep 27, 2025

System.DllNotFoundException: 'Unable to load DLL 'llava_shared' or one of its dependencies: The specified module could not be found. (0x8007007E)' #1255

Open

Multimodal embedding #1193

Open

Mtmd Implementation base

9931d0e

SignalRT force-pushed the mtmd_implementation branch from fcce175 to 9931d0e Compare September 28, 2025 14:54

SignalRT requested a review from Copilot September 28, 2025 15:51

Copilot AI reviewed Sep 28, 2025

View reviewed changes

SignalRT and others added 2 commits September 29, 2025 21:56

Update LLama/Native/NativeApi.cs

c307006

Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>

Resolve comment: SciSharp#1261 (comment)

3c92b07

SignalRT mentioned this pull request Sep 30, 2025

Qwen2.5-VL gguf model output garbled code #1194

Open

SignalRT added 2 commits October 5, 2025 13:47

Remove duplicate code

384ec34

Move common logic to LlamaExecutorBase

d5aab12

SignalRT marked this pull request as ready for review October 5, 2025 12:27

SignalRT mentioned this pull request Oct 6, 2025

[BUG]: Error in version 0.25.0 - LLama.Exceptions.RuntimeError: Failed to load the native library. #1275

Open

Merge remote-tracking branch 'upstream/master' into mtmd_implementation

7c49483

SignalRT requested a review from martindevans October 19, 2025 16:05

martindevans reviewed Oct 19, 2025

View reviewed changes

martindevans requested changes Oct 19, 2025

View reviewed changes

Mtmd implementation #1261

Are you sure you want to change the base?

Mtmd implementation #1261

Conversation

SignalRT commented Sep 27, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull Request Overview

Reviewed Changes

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Copilot AI Sep 28, 2025

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

martindevans left a comment

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

SignalRT commented Sep 27, 2025 •

edited

Loading